This entry documents the first observability experiment I've run against the LLM Workflow Router project: attempting to instrument a deterministic, content-blind workflow enforcement layer using the emerging OpenTelemetry GenAI semantic conventions.
What made the experiment interesting is that the Router itself does not perform any LLM operations. It does not call a provider API, inspect prompts, generate completions, retrieve context, or execute tools. It evaluates structured metadata against a declared workflow topology and returns a terminal structural decision.
In practice, that meant the Router sat directly at the edge of the current semantic model. The instrumentation worked — but only by repeatedly stepping outside the vocabulary the conventions currently provide.
The system under instrumentation
The LLM Workflow Router is a stateless middleware engine that enforces explicit execution topology in AI systems relying on large language models. It evaluates interaction metadata against strictly declared workflow rules and returns a terminal state.
The term “LLM router” is overloaded in this ecosystem. Most public projects under that name are model-selection routers — systems that pick between cheap and expensive models based on query complexity, task classification, or cost.
This Router is not that. It is a topology-enforcement layer: it validates whether a given workflow transition is permitted, independent of model choice or content. The two share a name but solve different problems.
Its responsibilities are intentionally narrow:
- Validate workflow configuration structure at load time.
- Evaluate whether transitions between workflow containers are permitted.
- Enforce invocation limits and topology constraints.
- Return a terminal decision state.
Importantly, the Router is deliberately content-blind. It does not observe prompts or model outputs. It observes only metadata and workflow structure.
Two operations were wrapped in OpenTelemetry spans:
validate_workflow_config() and
WorkflowEngine.evaluate().
Existing GenAI semantic conventions were used wherever possible. Every mismatch between the Router's semantics and the available conventions was recorded as a friction point.
Friction points observed
F1. No operation name exists for structural validation
The first mismatch appeared immediately during configuration loading.
The Router validates workflow topology before execution begins, but the
current gen_ai.operation.name values all describe runtime
inference activities:
chatgenerate_contentinvoke_agentretrievalexecute_tool
None describe static structural validation.
The only accurate representation was a non-standard value:
gen_ai.operation.name = "validate_workflow_config"
This exposed a deeper assumption embedded in the conventions: that meaningful GenAI observability begins only once runtime model activity begins.
For deterministic orchestration systems, much of the actual safety work happens before execution ever starts.
F2. gen_ai.provider.name is structurally inapplicable
The Router has no provider.
It does not call OpenAI, Anthropic, Gemini, local inference servers, or any external model endpoint.
Yet the conventions strongly imply that a provider exists for all GenAI-related spans.
Both spans omitted the attribute entirely, which leaves them technically incomplete relative to the current semantic model.
This is fine when every GenAI-related span sits inside a model-invocation context. But it presumes the category of system in scope: model invokers. A different category — orchestration, governance, and policy-enforcement layers that sit adjacent to models rather than calling them — has no clear home in the current conventions, and this Router is one instance of it.
Comparable systems include LangGraph's graph-execution layer, Temporal-style workflow engines wrapping LLM steps, and guardrail frameworks like NVIDIA NeMo Guardrails that gate execution without generating content. Each shares the property of being structurally involved in AI execution while being inferentially uninvolved.
F3. Workflow topology has no first-class semantic vocabulary
The Router evaluates transitions between containers in a workflow topology.
To instrument that meaningfully, I needed to record:
- Current container
- Requested transition target
- Terminal decision outcome
No current gen_ai.* attribute represents those concepts.
The closest available attribute,
gen_ai.conversation.id,
describes conversational identity rather than structural execution
position.
Using conversation semantics for topology semantics would have produced misleading telemetry, so custom attributes were introduced instead:
workflow.container
workflow.requested_action
workflow.result.state
This was one of the clearest points where the semantic model showed its conversational bias.
F4. Refusals are structurally meaningful, but semantically under-described
The Router emits typed structural refusal reasons:
UNKNOWN_CONTAINERINVALID_TRANSITIONMAX_INVOCATION_EXCEEDEDCONTAINER_REENTRY_BLOCKED
These are not operational failures. They are deliberate enforcement outcomes.
The existing semantic conventions contain:
error.type
But that attribute describes failures of execution — timeouts, API errors, transport problems, and exceptions.
A workflow refusal is different.
The Router is operating correctly when it refuses an invalid transition. Refusal is a successful enforcement outcome, not an error state.
The only workable solution was another custom attribute:
workflow.result.reason
F5. The conventions assume content observability as the primary mode
One of the most interesting frictions was not a missing attribute, but a missing category of system entirely.
The current GenAI semantic conventions are heavily centered around:
- Input messages
- Output messages
- Prompt content
- Completions
- Tool calls
- Retrieval context
The Router observes none of those things by design.
Its observability stance is intentionally structural rather than semantic. It traces metadata, topology, transition legality, and enforcement outcomes while remaining blind to user content entirely.
That distinction matters for both privacy and system architecture.
Current conventions implicitly treat content observability as the default shape of GenAI instrumentation. The Router demonstrated that another class of AI-adjacent systems exists: systems whose primary concern is structural governance rather than inference visibility.
F6. Standalone workflow evaluation has no clear trace-parenting model
Each Router evaluation generated its own isolated trace:
parent_id: null
This happened because there was no upstream model invocation span to inherit context from.
The current conventions strongly imply that workflow spans typically exist as children of model or agent execution spans.
But deterministic workflow enforcement can exist independently of any model runtime entirely.
In practice, this meant there was no obvious semantic guidance for how configuration validation spans and subsequent workflow evaluations should relate to one another structurally.
The result was observability fragmentation: every decision became an isolated trace instead of part of a larger structural lifecycle.
The deeper issue underneath the frictions
Individually, each mismatch seems small.
Together, though, they point toward a broader pattern: the current semantic conventions implicitly model GenAI systems primarily as systems of generation.
Prompts go in. Completions come out. Tools are called. Retrieval augments context.
That is a valid model for a large portion of the ecosystem. But it is incomplete for a class of systems that already exists: layers that participate in AI execution pipelines without themselves generating anything.
- Graph-execution and workflow orchestration engines (e.g. LangGraph).
- General-purpose workflow runtimes wrapping LLM steps (e.g. Temporal).
- Safety and policy gates (e.g. NVIDIA NeMo Guardrails).
- Validation and topology-enforcement layers, like the Router described here.
- Execution governors and deterministic orchestration runtimes more broadly.
These systems still participate directly in AI execution pipelines. They still need observability. But their semantics are structural rather than conversational.
The current conventions can technically accommodate them through custom attributes, but only awkwardly and inconsistently — as the six friction points above illustrate.
That awkwardness is useful.
Friction logs are valuable precisely because they reveal the shape of the assumptions hidden inside a specification.
In this case, the experiment suggests that the GenAI semantic conventions may eventually need a clearer distinction between:
- Inference semantics
- Structural orchestration semantics
- Governance and enforcement semantics
Right now those categories are blurred together under a model-centric view of AI systems.
The Router made the edges of that model visible.
Why this matters to me
One reason I care about observability work is that I increasingly think legibility is one of the central engineering problems of the AI era.
Not just model interpretability in the academic sense, but operational legibility:
- Can humans understand what a system is doing?
- Can they reconstruct why a decision happened?
- Can they audit transitions and enforcement behavior?
- Can they see where structure failed before harm compounds?
Deterministic workflow systems are one attempt to make AI behavior more structurally accountable.
Observability is what makes those structures visible once they exist.
Each of the six frictions above is, in its own small way, a place where that visibility currently breaks: a structural validation step that has no name, a refusal that gets misread as an error, a transition that has no vocabulary, a governance trace that floats free of any parent. Each is a place where the system did the right thing and the telemetry could not quite say so.
That's part of why this friction log felt worth writing down. The interesting thing wasn't that the instrumentation failed — it mostly worked.
The interesting thing was what the points of failure revealed about how the ecosystem currently imagines AI systems in the first place.
And for an “in development” specification, that's exactly the kind of edge case worth exploring early. 🌙